Intro
Definition (mean(平均值)): The sample mean(平均值) of observed values $x_1,…,x_n \in \mathbb{R}$ is
用sample把它和概率论里的mean/expectation区分一下。
Definition (median(中位数)): The sample median(中位数) of observed values is
with $x_1 \leq x_2 \leq … \leq x_n$ being the sorted data points.
如果总数为基数,就取最中间的那个数;如果总数是偶数,那就取中间2个的平均值。
Definition (Statistical model 统计模型):
Definition (Parameter): A (statistical) parameter of a statistical model $\mathcal{P}$ is a map $\gamma : \mathcal{P} \to \text{ some set }\Tau$ .
例子:
- 期望值 Mean/expectation
- Variance
- Correlations
Construction of Estimators
Definition (Estimators): A estimator is a function that map data to estimates of quantities of interest.
简单来讲,estimator就是一个函数,它的定义域是数据,叫estimand,它的值域里的元素则是叫estimates。
Plug-in Estimator
Definition (empirical distribution): The empirical distribution of $x_1, \ldots, x_n \in \mathbb{R}$ is the probability distribution $\hat{P}_n$ given by
一个相对离散的概率分布。
Definition (empirical distribution function (ecdf)): The empirical distribution function (ecdf) of $x_1, \ldots, x_n$is the distribution function of $\hat{P}_n$, which is
一个连续的分布函数。
Theorem (Glivenko—Cantelli): If $X_1, X_2, \ldots$ are i.i.d. random variables with cdf (cumulative distribution function) $F$, then
也就是说,当我们有足够多的样本时,这个esimator会趋近于它原本的分布。
Theorem: If $U \sim Unif(0,1)$,then $X:=F^{-1}(U)$ has cdf $F$.
Definition: The plug-in estimator of $\gamma(F)$ is the estimator $\hat{\gamma} = \gamma(\hat{F_n})$.
例子:
考虑期望值 $\gamma(F) = \int x dF(x)$。那么有
M-Estimator
Definition (M-estimator): An estimator $\hat{\theta}(X_1, \ldots, X_n)$ maximizing a criterion function of the form
where $m_\theta$ is a known function, is called an M-estimator (maximum-likelihood type).
例子:
- $\theta \in \mathbb{R}$, 如果选$m_\theta(x) = -(x - \theta)^2$,那么就会得到平均值 $\bar{X}_n$。
- 选$m_\theta(x) = -|x - \theta|$,会得到中位数。
Method of Moments (MOM)
Given a parametric model for real-valued observations
consider the moments
If it exists then the $j$-th moment may be estimated by
Definition: The MOM estimator $\hat{\theta}$ is the value of $\theta$ that solves the equation system
例子(高斯):
Suppose $P_\theta = \mathcal{N}(\mu, \sigma^2)$ with mean and variance unknown.So,
The density:
需要解的方程组:
最后得到sample mean以及empirical variance:
Maximum Likelihood Estimator (MLE)
Consider a parametric model for the observation
Assume the model $\mathcal{P} = \{ P_\theta : \theta \in \Theta \}$ is dominated by a $\sigma$-finite measure $\nu$, i.e., $P_\theta \ll \nu$ for all $\theta \in \Theta$, and so we have densities
(这个so的结论是概率论里的结论。)
Definition: The function $L_x(\theta) = p_\theta(x)$ is the likelihood function of model $\mathcal{P}$ for the data $x$.
有密度函数直接套密度函数(density function)。
Definition: The maximum likelihood estimate (MLE) of $\theta$ is
If $\hat{\theta}(X)$ is a measurable function of the observation $X$, then $\hat{\theta}(X)$ is called maximum likelihood estimator (MLE) of $\theta$.
只不过我们通常其实会考虑所谓的 log -likelihood function
这样做有2点好处:
- 可以避免numerical overflow;
- 更方便计算。(比如说如果$L_x$本身是product的形式,那$l_x$则会变成sum的形式。)
例子(高斯):
Suppose $X_1, \ldots, X_n$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$, so $\theta = (\mu, \sigma^2) \in \mathbb{R} \times (0,\infty)$.
Assume $n \ge 2$, so that $\frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2 > 0
\quad \text{a.s.}$.
The log -likelihood function:
不难得到:
贝叶斯估计(Bayes Estimators)
在之前的构造里,所有参数都取决于我们拿到的数据,完全不受我们任何经验/前置知识的影响。
但我们现在希望修改这个模式:我们希望最后得到的参数同时取决于数据以及我们的经验/先前的判断。
贝叶斯推断(Bayesian Inference)的流程:
1. 把先前的常数值$\theta$当作random variable(随机变量),并选择一个prior distribution(先验分布,即我们观察数据前的做出的判断)。
2. 把$P_\theta$ 当作 the conditional distribution of $X$ given $\theta$。
3. 在观察到了数据 $x$ 之后,做统计推断时,看 $\theta$ 的后验分布(posterior distribution),也就是 “在观测到 $X = x$ 之后,$\theta$ 的条件分布”。
Consider an observation modeled as $X\sim P_\theta, \theta \in \Theta \subseteq \mathbb{R}.$
Theorem (Bayes theorem): Suppose the prior distribution has density $\pi$ w.r.t. a measure $\nu$ and
$P_\theta \ll \nu \ \ \forall \theta$ with densities $p_\theta(x) = p(x \mid \theta)$.
Then the posterior distribution has density (w.r.t. $\nu$):
where
is the prior predictive density of $X$.
Bayes estimators of $\theta$ are obtained as characteristics of the posterior distribution.
Most frequently, one considers the posterior mean:
例子(高斯):
Assume $X_1, \ldots, X_n$ i.i.d.\ $\mathcal{N}(\mu, \sigma^2)$ with $\sigma^2 > 0$ known. We select as prior distribution $\mu \sim \mathcal{N}(m, \tau^2)$ so
The likelihood function is equal to
The posterior density is
We recognize that the posterior distribution will be a normal distribution.
More precisely,
where
We conclude that since $p(\mu \mid X) \propto \exp\{ a \cdot (\mu - b/a)^2 \}$,
it holds that $p(\mu \mid X)$ is the density of a normal distribution with mean and variance:
The posterior mean is a convex combination of $\overline{X}_n$ and the prior mean $m$:
当我们设:
则有:
注意到当n趋近于无穷时,$\lambda$会趋近于1。也就是说,当n越大,最后算出来的mean就会越取决于我们拿到的数据。反之,n越小,则会更取决于我们的先验知识。
Mean Square Error, Bias and Variance
Definition: The mean square error is defined as
Theorem: The mean square error decomposes as
where $\mathrm{Bias}_\theta[\hat{\theta}]:= \mathbb{E}_\theta[\hat{\theta}] -\theta$ is the bias of $\hat{\theta}.$
Proof:
Write $\mathrm{MSE}_\theta[\hat{\theta}]
= \mathbb{E}_\theta
\Big[
\big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}] + \mathbb{E}_\theta[\hat{\theta}] - \theta \big)^2 \Big] $ and expand
$\square$